Pick one software call, step through each hardware block, and track both control ownership and data movement. Each panel starts with plain-language behavior, then maps that behavior to the low-level details used during embedded Linux bring-up and debugging.
idle
Unified call flow across CPU, GPU, DMA, RAM, MMU, and IOMMU
Run one scenario at a time and track who owns control and where data moves. This is where you can catch coherence bugs, translation faults, DMA stalls, and PCIe (Peripheral Component Interconnect Express) backpressure.
Flow guidance: pick a scenario and follow the highlighted block. The active step tells you who owns control and which unit is moving bytes.
scenario runs
0
fault events
0
dma transactions
0
estimated latency
0 us
Current call: idle
Control ownership: idle
Data path: idle
How to use this map: start with one call, then walk stage by stage. When something is unclear, open the focused panel for that topic (cache, coherency, IOMMU, DMA sync, CAS, barriers, or bus timing).
Cache policy and hierarchy behavior (L1 focus, with L2/L3 context)
Use this as a cache math lab. Split one address into tag (which memory line), index (which set), and offset (which byte in the line). Then run traffic shapes to see locality, conflict thrash, and how write policy changes DRAM pressure.
Write policy:On write miss:
Addr (hex):
Preset traffic:
tag: bits 31..9
index: bits 8..6
offset: bits 5..0
Decoded:tag=0x00028, set=0x5, offset=0x00
Address split with formulasFor a 64-byte line and 8 sets: tag = addr >> 9, set = (addr >> 6) & 0x7, offset = addr & 0x3F. tag tells which memory line this is, set picks one cache row, and offset picks the byte inside the 64-byte line.
Policy quick readWrite-back (WB) keeps dirty lines local until eviction. Write-through (WT) sends every store to DRAM. Write-allocate (WA) fetches missed lines before write; no-write-allocate (NWA) writes around the cache on miss.
What each cache table column meansset is the index-selected row. way0..way3 are the four slots checked in parallel. Each slot prints tag; D means dirty (modified in cache, not written back yet). A hit means one way in that row has both valid=1 and matching tag.
Why the hardware diagram uses different bit rangesThe top diagram uses a 48-bit physical-address style notation (tag[47:14], index[13:6]) to show real hardware wiring. The interactive table below is intentionally simplified to an 8-set teaching model (index[8:6]) so conflicts are easy to see.
Sequential walkAddresses increase in order. Early accesses miss because lines are not loaded yet, then hit rate improves because nearby bytes share the same cache line (spatial locality).
Thrash (stride 0x200)Stride 0x200 keeps the same set index while changing the tag. Too many lines compete for one set, so they evict each other even though total cache capacity looks large enough.
Streaming reader / write-heavy loggerA streaming reader touches many lines once, so temporal reuse is weak. A write-heavy logger reuses a smaller hot region, so hit rate rises, but dirty-line eviction can become the bottleneck.
Random pointer chaseNext addresses are data-dependent and scattered. Both spatial and temporal locality are weak, so misses stay high and policy choices (WT/NWA vs WB/WA) become visible in latency and DRAM traffic.
hits
0
misses
0
writebacks
0
dram writes
0
hit rate
n/a
Write-back + write-allocate (default): on a store miss, the core first fetches the line into cache. Stores mark the line dirty, and DRAM is updated later on eviction. This reduces DRAM traffic, but dirty data is not durable until writeback, which is why sync() and cache flushes matter.
MESI (Modified, Exclusive, Shared, Invalid) with real values and bus messages
Use one shared address X with two cores and private caches. Trigger loads and stores from each core, then watch values, state transitions, bus messages, and DRAM updates in order.
MESI (Modified, Exclusive, Shared, Invalid)
Four-state tag attached to every cache line: Modified (I own the only dirty copy, RAM is stale), Exclusive (clean and mine alone, RAM matches), Shared (clean, other caches may also hold it), Invalid (line is empty or stale, must refetch). The snoop controller in each L1 watches the coherent bus and flips its own state when another core acts on the same physical address.
BusRd / BusRdX / Invalidate
The three bus messages that drive MESI. BusRd = "I want to read this line, anyone have it?" BusRdX = "I want to write it, give me exclusive ownership." Invalidate = "I already have it shared, drop your copy so I can write." On ARM A78AE this travels over the DSU (DynamIQ Shared Unit) at the L2 boundary.
Write value:
Reading the state column:M means the only valid copy is dirty. E means the only valid copy is clean. S means multiple caches hold clean copies. I means no valid copy in that cache. State changes are driven by bus messages: BusRd (shared read), BusRdX (exclusive write ownership), BusUpgr (shared to modified), and Flush (writeback to DRAM).
DMA (Direct Memory Access): doorbells, descriptor rings, and cache ownership
In most drivers the protocol is: CPU writes a descriptor ring in RAM, rings a doorbell register, then hardware issues bus reads and writes (AXI on SoC, PCIe TLPs for external devices), and finally raises interrupts. This panel shows each stage, ownership handoff, and why cache sync is required in streaming mappings.
DMA control protocol (what actually happens)
1) Driver writes descriptor entries (address, length, flags, ownership). 2) Driver rings a doorbell register so the DMA engine starts. 3) DMA engine fetches descriptors over the bus (AXI read on SoC, MRd/CplD pair on PCIe). 4) DMA engine transfers payload data between device and RAM. 5) DMA engine raises an interrupt so software can reclaim buffers. In double-buffer and scatter-gather modes, this repeats per descriptor.
Interrupt flags used here
HT = Half Transfer, TC = Transfer Complete, TE = Transfer Error, FE = FIFO Error. These are status bits your interrupt handler checks before it decides whether to queue the next buffer, retry, or reset the engine.
Mapping:Mode:
Coherent mapping: allocation comes from a non-cacheable region. CPU writes bypass L1 and land in DRAM immediately, so the device sees current data without explicit cache sync. This is common for descriptor rings and doorbell shadow data. For streaming buffers, ownership must be handed off explicitly with dma_sync_single_for_device() before DMA and dma_sync_single_for_cpu() after DMA.
This panel follows one translation chain end to end: VA (Virtual Address) to PA (Physical Address) for CPU loads, and bus/device addresses to host memory for DMA. In virtual machines, the chain becomes GVA (Guest Virtual Address) → IPA (Intermediate Physical Address) → HPA (Host Physical Address).
Process VA (MMU walk)
HPA view: stage-2 off, so HPA = PA.
IOMMU / stage-2
TTBR0_EL1 means Translation Table Base Register 0 at EL1. It points to the start address of the current process's stage-1 L0 table (the first page-table level used by the CPU walker).
L0/L1/L2/L3/off fields are the VA slices consumed by each walk step: L0 selects an entry in the L0 table, that entry points to L1, L1 points to L2, and L2 points to L3. The L3 entry provides the page frame number; off[11:0] picks the byte inside the 4 KB page.
ASID stands for Address Space Identifier. The TLB key is typically {ASID, VPN}, so mappings from different processes can coexist without full TLB flushes on every context switch.
IOMMU side maps device-visible addresses (IOVA) into HPA (Host Physical Address). This bounds DMA to approved ranges and blocks rogue descriptors from writing outside the assigned memory window.
Stage-2 cost: without virtualization, one TLB miss needs up to four DRAM reads (one per page-table level). With stage-2 enabled, those reads can themselves need translation, so a cold miss may require many more memory accesses. Hardware handles this with nested walkers: AMD NPT (Nested Page Tables), Intel EPT (Extended Page Tables), and ARM stage-2.
Memory ordering: two tests that separate the models
The flag handshake separates relaxed ordering from stronger modes. To compare acquire/release with seq_cst, run the store-buffer (Dekker-style) test: each thread writes one variable, then reads the other. Without a full fence, both reads can legally return zero on weakly ordered systems.
Test:
Ordering:
Flag handshakeProducer writes data then raises flag. Consumer spins on flag then reads data.
Store buffer testEach core stores to one variable then loads the other. Weak ordering may let both reads see old zeros.
runs
0
failures
0
fail rate
n/a
Pick a test and memory ordering, then run it. The result shows how weakly ordered CPUs (ARM, RISC-V) can reorder operations unless you use stronger ordering or fences.
Spinlock, mutex, priority inversion, and compare-and-swap
These scenarios compare spinlock and mutex behavior, priority inversion, condition variables, and lock-free CAS updates. The goal is practical debugging intuition: who is running, who is waiting, and why progress stops.
C source → assembly → pipeline stages
Start from a short C snippet, inspect an ARM-style lowering, then step through a 5-stage pipeline (fetch, decode, execute, memory, writeback). Compare the cycle cost of an L1 miss versus a DRAM miss.
Snippet:
C source
Compiled asm (AArch64-ish)
F = FetchReads instruction bytes from I-cache using PC.
D = DecodeDecodes opcode and reads source registers.
E = ExecuteALU does math, branch compare, or address generation.
M = MemoryLoads/stores data through D-cache.
W = WritebackWrites result back to register file.
Pipeline timeline (cycle 0)
CPU datapath circuit view
What happens on a load miss: Execute computes the address, Memory requests L1, and L1 misses. Any dependent instruction waits for data. Real out-of-order cores may continue with independent work, but this teaching model is in-order so dependencies are easy to see.
NVIDIA GPU warp, shared memory, and __syncthreads()
This model follows NVIDIA SM execution: a warp has 32 lockstep threads, shared memory is banked, and bank conflicts serialize access. __syncthreads() is a block-wide barrier; misuse can cause stale reads or deadlock.
Access pattern:
Coalesced pattern: thread t reads base + t*4. Threads touch adjacent addresses, so the warp can issue one efficient memory transaction. This is the ideal access pattern.
Firmware notes: warp execution and bank conflicts
A warp is the smallest scheduling unit on an NVIDIA SM. All 32 threads execute one instruction together. If branches diverge (for example, 16 threads go one way and 16 go the other), the SM runs both paths serially and reconverges. In practice you see this as lower warp_execution_efficiency in Nsight Compute.
Shared memory has 32 banks that are 4 bytes wide. A bank conflict occurs when multiple threads in the same warp hit the same bank on different addresses. The hardware serializes those accesses. Common fixes are changing stride or adding padding (for example, 33 columns in 2D tiles).
__syncthreads() is a block-wide barrier, so every thread in the block must reach it. If one thread skips the barrier on a divergent path, behavior is undefined and often ends in a deadlock or stale reads.
For the full GPU treatment with interactive divergence and scheduler simulators, open the CUDA Guide.
Kernel memory: zones, buddy, and slab
The kernel splits physical memory into zones (DMA, DMA32, NORMAL, HIGHMEM). Buddy allocation manages page ranges, and slab allocation turns those pages into fixed-size kernel objects (for example task_struct, inode, and skbuff).
RAM page map (changes every allocation)
Recent allocation trace
call
zone
bytes
pages
ram effect
Reading this panel: the buddy allocator manages whole pages, while slab caches carve those pages into fixed-size objects (for example kmalloc-64). The RAM page map changes after each allocation or free because ownership of pages moves between free lists, slab pages, and DMA-reserved regions. Why ZONE_DMA still exists: some devices can only DMA into low physical addresses (for example 24-bit limits), so the kernel reserves a low region where dma_alloc_coherent(..., GFP_DMA) is likely to succeed.
Syscall path lab: read/write/ioctl/mmap/fsync
Pick a syscall and trace the full path through trap handling, VFS, filesystem, page cache, block layer, driver, DMA, IRQ, and return to userspace. Compare control-heavy calls (ioctl), mapping calls (mmap), and durability calls (fsync/sync).
VFS (Virtual File System)
A thin indirection layer in the Linux kernel. Userspace calls read()/write()/ioctl() on a file descriptor; the syscall handler dereferences that fd to a struct file, which points to a struct file_operations vtable supplied by whichever driver or filesystem owns the inode. VFS itself does not read blocks - it dispatches. That is why the same read() can end up in ext4, in a ramdisk, or in a v4l2 camera driver.
Page cache
The kernel's RAM-resident copy of file data, indexed by (inode, offset). read() hits it on warm pages and skips the block layer entirely; write() marks pages dirty and returns before any disk I/O. fsync() is the call that actually forces dirty pages down through the block layer and the device FUA flush.
NVMe (Non-Volatile Memory Express)
NVMe is a queue-based storage protocol. The kernel posts commands into a submission queue, rings a doorbell register, the controller performs DMA to/from RAM, then raises an MSI-X interrupt on completion. In this panel, NVMe appears on read-miss, writeback, and fsync durability paths.
Why the page cache exists: DRAM access is far faster than storage I/O (nanoseconds versus microseconds or milliseconds). A page-cache hit avoids device latency. The kernel uses free RAM for cache and evicts under pressure. sync() writes back dirty pages globally, while fsync(fd) waits for one file to become durable.
Serial protocol waveforms and digital receive logic
Choose a protocol, edit one payload byte, and watch how receiver logic interprets clock edges into bits and bytes. Focus is digital-circuit behavior: sampling, shifting, framing, and commit to memory.
Protocol:Byte:
Select a protocol to decode one frame.
Protocol engine data path (receiver view)
Clock and framing notes
MIPI CSI-2 camera pipeline
Lens → sensor → CSI-2 receiver → ISP → DMA → IOMMU → V4L2 queue → userspace. This view tracks where timing pressure appears and why frame drops occur.
Runtime pipeline state
CSI-2 packet path (simplified)
V4L2 buffer flow: one buffer is currently being filled by DMA, one is active in userspace, and the rest are queued. If userspace keeps a completed buffer too long, queue depth shrinks and CSI frames can drop with V4L2_BUF_FLAG_ERROR.
CPU ↔ GPU data path: pageable, pinned, and unified memory
A CUDA kernel can run only after input data is visible to the GPU. Transfer mode determines latency and throughput. Compare pageable, pinned, and unified memory to see where transfer time and ownership handoff are spent.
PCIe (Peripheral Component Interconnect Express)
The serial, packet-switched link between the CPU complex and the GPU. Transfers are TLPs (Transaction Layer Packets): MRd memory read, MWr memory write, Cpl/CplD completion with or without data. Each lane is a differential pair; Gen4 x16 ≈ 32 GB/s one way. Flow control is credit-based, so an overloaded receiver throttles the sender instead of dropping packets.
Doorbell / BAR
A BAR (Base Address Register) is a window into the GPU's MMIO space that the CPU can store to directly. A doorbell is a specific 32-bit register inside that window; writing to it tells the GPU "go look at the work queue you already know about." One store, one TLP, kernel launch begins.
Transfer mode:
Payload size:
Wall time:0.000 ms · DMA bursts:0 · Host copies:0
Pageable host buffer: CUDA cannot DMA directly from memory that may be swapped out. The runtime first copies data into pinned staging memory, then GPU DMA reads from that staging buffer.
Firmware notes: when each mode is the right choice
Pageable memory is acceptable for one-off transfers such as startup weights or configuration blobs. The runtime first stages data through pinned memory, so profile with nsys if transfer time matters.
Pinned memory (cudaMallocHost) is best for repeated high-throughput transfers. It enables direct DMA and works well with cudaMemcpyAsync, but too much pinned memory can hurt the rest of the system because it cannot be swapped.
Unified memory (cudaMallocManaged) gives one pointer for CPU and GPU. On discrete GPUs, pages migrate on demand through UVM (Unified Virtual Memory) faults. On integrated SoCs (for example Jetson), CPU and GPU share DRAM, so this mode can avoid explicit copies.
Practical default: start with unified memory on Jetson, and start with pinned + async transfers on discrete GPUs.
An SM (Streaming Multiprocessor) is the unit that schedules warps and executes instructions. Click blocks to connect hardware units to practical performance outcomes: occupancy, memory-latency hiding, and shared-memory pressure.
Click a block in the SM diagram
Hover any sub-unit (warp scheduler, L1/Shared, tensor core, register file, FP32 lane bank, INT32 lane bank, LD/ST, SFU) to see a technical explanation.
L1/Shared split (configurable per kernel):
Why one SM matters to firmware engineers
GPU performance is easiest to debug one SM at a time: what this SM can issue each cycle, and what stalls it.
4 warp schedulers: each scheduler picks ready warps. More ready warps means better latency hiding.
Register pressure: high registers-per-thread lowers occupancy, so fewer warps are available to hide memory latency.
L1/shared split: changing shared-memory carveout can help one kernel and hurt another.
Tensor/FP/INT pipelines: throughput depends on whether your instruction mix matches available units.
Firmware and driver teams meet this in profiling counters, launch tuning, and bring-up checks (reported SM count, cache sizes, and scheduling behavior).
Write-through, write-back, and write-combine on one timeline
Run the same store under three L1 policies. Compare when DRAM is written and what DMB ISH (Data Memory Barrier, Inner Shareable) forces to complete before later instructions continue.
WCB (Write Combine Buffer)
A small staging FIFO between the core and the coherent bus for regions marked as Normal Non-Cacheable or Device-GRE. Consecutive stores to the same cache-line-aligned range are merged into one wider burst, so four 32-bit writes to a GPU ring buffer become one 128-bit beat on AXI. The downside is ordering: a WCB can hold your store for cycles before anyone else sees it, which is why pushing a doorbell needs DMB ISHST (Data Memory Barrier, Inner Shareable, Store) to drain it.
Write-through vs write-back vs write-combine
WT (write-through) stores hit L1 and propagate to L2/DRAM immediately: simple and predictable, but higher traffic. WB (write-back) stores stay in L1 as dirty data until eviction or snoop: faster, but visibility depends on coherence. WC (write-combine) batches stores in WCB and flushes as bursts: useful for MMIO fills and GPU command buffers.
WT dram writes
0
WB dram writes
0
WC dram writes
0
WC buffer hits
0
Write-through: every store goes to L1 and memory at the same time. Simple, slow, burns DRAM bandwidth. Good for tiny instruction caches. Write-back + write-allocate: store fills the line into L1 on miss, dirties it, DRAM gets written later on eviction. One DRAM write per 64 bytes of stores to the same line. Write-combine: no allocation, no ordering within the WC buffer. Four adjacent stores coalesce into one burst. You must issue DMB ISH (Data Memory Barrier, Inner Shareable) before the device can observe the writes in program order. The GPU command ring and MIPI DSI command FIFO usually sit in WC regions.
DMA coherent vs streaming: same transfer goal, different ownership rules
Both modes move bytes between device and RAM. The difference is cache ownership. Coherent mappings avoid manual cache sync; streaming mappings need explicit handoff calls before and after DMA.
Coherent (dma_alloc_coherent): the pool is mapped as Device-nGnRnE or Normal-NC. nGnRnE means non-Gathering, non-Reordering, no Early write acknowledgment: writes are not merged, not reordered, and not acknowledged early. CPU stores bypass L1 and land in DRAM, so devices see fresh data without manual flushes. Cost: no cache, so CPU throughput is lower. Use this for small descriptor rings, mailboxes, and doorbell shadow data. Streaming (dma_map_single): the buffer is normal cacheable memory. Before the kick, dma_sync_single_for_device() cleans (and invalidates, if FROM_DEVICE) the lines. After the IRQ, dma_sync_single_for_cpu() invalidates stale lines so the CPU reads fresh DRAM. Miss either call and you get silent corruption. Bounce buffer: if the device DMA mask cannot reach the buffer's physical address, the DMA core copies through a low-memory staging buffer. You pay a memcpy per transfer, but the API is unchanged.
Video buffering timeline: writer pointer vs scanout pointer
In a healthy double-buffer pipeline, ISP writes the next frame into an idle buffer while display scans the current frame from a different buffer. At vsync, roles swap atomically.
displayed
0
dropped
0
tears
0
Read this as two moving pointers. Write pointer (ISP DMA) fills lines. Scan pointer (display DMA) reads lines for the panel. Double buffering keeps them on separate buffers, so no tearing. Single buffering lets them cross, so one frame can contain mixed old/new lines.
Condition variables for producer/consumer correctness
Use this when a thread should sleep until shared state changes. pthread_cond_wait releases the mutex and sleeps atomically, then re-locks before returning.
Why developers need this: busy-wait loops waste CPU, but sleeping without a condition protocol causes lost wakeups and race bugs. Always guard the wait with while(!pred) so every wakeup re-checks state under the mutex.
End-to-end: writel(1, BAR0 + GPIO_SET) lights an LED
Trace one memory-mapped write from instruction issue to pad toggle: memory type check, barrier ordering, interconnect transfer, SMMU (System Memory Management Unit) translation, GPIO decode, then output latch.
axi beats
0
smmu xlates
0
smmu faults
0
led state
off
Device-nGnRnE means Device, non-Gathering, non-Reordering, no Early write acknowledgment: each MMIO store is strongly ordered and not merged. DMB OSHST means Data Memory Barrier, Outer Shareable, Store; it forces prior outer-shareable stores to become visible before execution continues. Then SMMU stage-2 resolves final GPIO physical address. If SMMU context is disabled, translation faults occur and LED stays off.
Hypervisor translation chain: GVA → IPA → HPA
In a guest VM, software at EL1 (Exception Level 1) translates GVA (Guest Virtual Address) to IPA (Intermediate Physical Address). Hardware controlled by EL2 (Exception Level 2) then translates IPA to HPA (Host Physical Address).
Stage-1 / Stage-2
Stage-1 uses TTBR0_EL1 (Translation Table Base Register 0, EL1) to walk guest page tables. Stage-2 uses VTTBR_EL2 (Virtualization Translation Table Base Register, EL2) to map guest physical space into host physical space. Hardware chains both walks; guest software cannot bypass stage-2.
Why DMA uses stage-2 too
When a PCIe device issues a DMA, the SMMU applies stage-1 (device driver's IOVA map) and then stage-2 (hypervisor's guest-physical map). This is how a passthrough NVMe can issue DMA with only guest-physical addresses without escaping the VM.
PCIe (Peripheral Component Interconnect Express) TLPs on the wire
CPU↔GPU traffic on PCIe is packetized. Send posted writes, send non-posted reads, and watch completions return. If receiver credits run out, traffic pauses until FC Update (Flow Control Update) replenishes credits.
TLP (Transaction Layer Packet)
The transport unit across PCIe. MWr is posted (fire-and-forget), MRd is non-posted (requester waits for CplD, short for Completion with Data). A doorbell is one MWr; a DMA descriptor fetch is one MRd + one CplD.
Credits
Each receiver advertises header and payload credit limits. The sender decrements local credit counters before transmit. If credits reach zero, transmission pauses. FC Update (Flow Control Update) DLLPs refill credits as receive buffers drain.
tlp sent
0
header credits
12
data credits (KB)
16
completions
0
Page faults: minor, major, COW, file-backed mmap
From userspace, all four faults look like a stalled load instruction. Inside the kernel, paths differ: minor faults map existing memory, major faults need I/O, COW faults duplicate shared pages, and file-backed faults pull from storage via page cache.
Minor vs major
Minor: the PTE (Page Table Entry) was not present, but the page was already in RAM (anonymous zero-fill, or file page still in page cache). Fix-up is a couple of microseconds. Major: the page was paged out to swap or was never read from disk. Fix-up requires I/O, typically milliseconds.
COW (copy-on-write)
After fork(), parent and child share anonymous pages read-only. The first write triggers a page fault with FSR.WnR=1 on a writable VMA (Virtual Memory Area) but a read-only PTE. The handler allocates a fresh page, copies data, updates the PTE writable, and retries the instruction.
Why write() returning is not on-disk: fsync, journal, FUA
A successful write() means bytes reached page cache, not durable media. Power-loss durability also needs writeback, journal commit, and device cache flush. Run each call path to see what actually happens.
Journal (jbd2)
Ext4 writes metadata changes into a circular log before touching the real inode blocks. If power is lost mid-write, recovery replays the log to restore consistency. fsync() forces both the data block and the metadata journal commit.
FUA / FLUSH
NVMe commands can carry a Force Unit Access bit that bypasses the device cache for durability, or be preceded by a Flush command that drains the cache. Without one of these, the device may return completion while bytes still sit in its volatile DRAM.
Self-contained offline build. Save this page to your device for a local copy.